Skip to content

feat(server): implement multi-tenancy provider with file and HTTP backends#979

Open
Pangjiping wants to merge 6 commits into
opensandbox-group:mainfrom
Pangjiping:feature/multi-tenancy-provider
Open

feat(server): implement multi-tenancy provider with file and HTTP backends#979
Pangjiping wants to merge 6 commits into
opensandbox-group:mainfrom
Pangjiping:feature/multi-tenancy-provider

Conversation

@Pangjiping
Copy link
Copy Markdown
Collaborator

@Pangjiping Pangjiping commented Jun 5, 2026

Summary

Wire TenantProvider interface through full request chain:

  • TenantProvider Protocol + FileTenantProvider + HTTPTenantProvider
  • Auth middleware multi-tenant/single-tenant mode branching
  • ContextVar-based tenant propagation (tenants/context.py)
  • Dynamic namespace resolution in KubernetesSandboxService
  • [tenants] config section in server.toml (provider=file|http)
  • Startup guards: docker+tenants fatal, api_key+tenants mutual exclusion
  • HTTP provider: per-key TTL cache, sync fetch, max_stale, 401 eviction
  • 27 e2e tests covering both providers end-to-end
  • Update OSEP-0012 with formal interface spec and HTTP contract
  • closes [Feature] 拆分 chatrs 让一个集群能承载多个命名空间沙箱 #972 沙箱能否支持多租户 #497

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

Pangjiping and others added 5 commits June 5, 2026 17:44
…kends

Wire TenantProvider interface through full request chain:
- TenantProvider Protocol + FileTenantProvider + HTTPTenantProvider
- Auth middleware multi-tenant/single-tenant mode branching
- ContextVar-based tenant propagation (tenants/context.py)
- Dynamic namespace resolution in KubernetesSandboxService
- [tenants] config section in server.toml (provider=file|http)
- Startup guards: docker+tenants fatal, api_key+tenants mutual exclusion
- HTTP provider: per-key TTL cache, sync fetch, max_stale, 401 eviction
- 27 e2e tests covering both providers end-to-end
- Update OSEP-0012 with formal interface spec and HTTP contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New script scripts/python-k8s-e2e-multi-tenant.sh with file/http variants
- Add multi-tenant-file and multi-tenant-http matrix entries to nightly workflow
- Helm chart: support tenantsToml value for mounting tenants.toml ConfigMap
- HTTP variant: mock Python provider on host, accessible via Kind bridge IP
- Both variants verify auth (valid key→200, invalid→401, missing→401)
  then run full SDK kubernetes-mini test suite with tenant key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use Kind control-plane container's gateway IP as primary detection,
with fallbacks. Add pre-deploy reachability check for the mock endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Pangjiping Pangjiping added documentation Improvements or additions to documentation component/server feature New feature or request labels Jun 5, 2026
Copy link
Copy Markdown
Collaborator Author

@Pangjiping Pangjiping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid architecture — TenantProvider protocol, ContextVar propagation, namespace-scoped isolation all look correct. One critical issue (sync HTTP blocking the event loop) should be fixed before merge; the rest are improvements.

Comment thread server/opensandbox_server/tenants/http_provider.py
Comment thread server/opensandbox_server/tenants/http_provider.py
Comment thread server/opensandbox_server/tenants/__init__.py Outdated
Comment thread server/tests/test_multi_tenancy.py Outdated
Comment thread server/opensandbox_server/tenants/models.py Outdated
- Fix sync HTTP blocking event loop: use asyncio.to_thread() in auth middleware
- Add per-key singleflight pattern to prevent thundering herd on TTL expiry
- Log warning at startup when HTTP tenant endpoint is not HTTPS
- Rename _resolve_tenants_path to resolve_tenants_path (public API)
- Use tuple for TenantEntry.api_keys for true frozen immutability
- Extract validate_tenant_config() so startup guards are testable
- Add [tenants] section to configuration.md, README, and example configs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Pangjiping Pangjiping marked this pull request as ready for review June 6, 2026 07:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d02b0be6c3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 87 to 88
if self._is_proxy_path(request.url.path):
return await call_next(request)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Authenticate proxy requests before resolving tenant namespaces

In multi-tenant Kubernetes deployments, /sandboxes/{id}/proxy/... requests return before the tenant provider is consulted, so current_tenant is never set even when the client sends a valid tenant API key. The proxy handler calls lifecycle.sandbox_service.get_endpoint(...), and the new Kubernetes service resolves that through _resolve_namespace(), which falls back to the default namespace when no tenant context exists; sandboxes created under a tenant namespace will therefore be reported missing or proxied from the wrong namespace. The proxy exemption needs to be conditional or still establish tenant context in multi-tenant mode.

Useful? React with 👍 / 👎.

Comment on lines +135 to +138
- name: tenants
mountPath: /etc/opensandbox/tenants.toml
subPath: tenants.toml
readOnly: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mount tenants ConfigMap without subPath for reloads

The file provider is documented and implemented as a hot-reloading tenants.toml, but the Helm chart mounts the ConfigMap key via subPath; Kubernetes documents that ConfigMap mounts using subPath do not receive updates when the ConfigMap changes. As a result, updating tenantsToml in a Helm release will not change /etc/opensandbox/tenants.toml in the running server pod, so key rotation/add/remove will not be observed until the pod restarts.

Useful? React with 👍 / 👎.

Comment on lines 709 to +711
workload = self.workload_provider.get_workload(
sandbox_id=sandbox_id,
namespace=self.namespace,
namespace=self._resolve_namespace(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate tenant context into renew workers

When renew_intent is enabled in multi-tenant Kubernetes mode, renew work is processed by the background RenewIntentConsumer tasks, not in the authenticated request context. Those workers only carry sandbox_id, so this new _resolve_namespace() call falls back to the configured default namespace and cannot find or renew sandboxes that were created in a tenant namespace; the renew work item needs to include the tenant namespace or otherwise set tenant context before calling the service.

Useful? React with 👍 / 👎.

Comment on lines +174 to +180
if not is_leader:
event.wait(timeout=self._config.timeout_seconds)
with self._lock:
cached = self._cache.get(api_key)
if cached:
return cached.tenant
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve HTTP lookup errors for waiters

For concurrent cold lookups of the same valid API key, only the leader performs the HTTP request while the other requests wait here; if the leader times out, gets a 5xx, or simply takes longer than this timeout, the waiters see no cache entry and return None, which the auth middleware converts to 401 INVALID_API_KEY. That makes provider outages or slow lookups look like invalid credentials for some callers instead of returning the same 503/tenant result as the leader, so the singleflight path should propagate the leader's result or error to waiters.

Useful? React with 👍 / 👎.

Comment on lines +63 to +64
_tenant_provider = FileTenantProvider(_tenants_path)
_tenant_provider.start()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate file tenants before starting

The multi-tenancy design added in oseps/0012-multi-tenancy.md says startup must validate that all tenant namespaces exist and are accessible, but file-provider startup only parses the TOML and starts the watcher. With a misspelled or unauthorized namespace the server comes up healthy and every request for that tenant fails later during workload operations, so the file-provider path should check the loaded namespaces against Kubernetes before accepting the config.

Useful? React with 👍 / 👎.

Comment on lines +97 to +99
def lookup(self, api_key: str) -> Optional[TenantEntry]:
with self._lock:
return self._lookup.get(api_key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use constant-time checks for file tenant keys

In file-provider mode, every authenticated request resolves the presented API key through a normal Python dict lookup. The multi-tenancy spec added in this commit explicitly requires constant-time API key comparison, but dict lookup/equality is key-dependent and does not provide that property, so this backend should compare candidate keys with a constant-time primitive (or store and compare keyed digests) rather than indexing directly by the secret.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/server documentation Improvements or additions to documentation feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 拆分 chatrs 让一个集群能承载多个命名空间沙箱

1 participant